Introduction

Source

The Dataset was sourced from Kaggle Anime dataset.

What is an anime?

An anime is a hand-drawn or computer animation originated from Japan. Anime describes all animated work, regardless of the style or origin. The anime industry consists of over 430 production companies, including major studios like Studio Ghibli, Sunrise, and Toei Animation. As of 2016, Japanese anime accounted for 60% of the world’s animated television shows.

Motivation

As an Anime Licensor, we buy the rights to sell and stream anime. The licenses have a fixed duration, when an anime license expires, there are some cases where licenses are not renewed for one reason or another. For example, if a show is not popular enough to make a profit, some companies may just let the license expire since there is not enough demand for it.

Goal of the project

The aim of our project is to predict if it’s possible, the Popularity of an anime according to several features we have access to. The features we will use are the genre, type, number of episodes, studio, source, duration, and rating of the anime. The popularity is an important feature for us as a licensor. Indeed, it represents the position of each anime based on the number of users who added the anime to their list. In other words, an anime with high popularity has more success, are more watched and attract more people than anime with a low popularity.

Data

To predict the popularity we used the anime dataset. It contains thirty-five variables with information about 17.562 anime observations and the preference from 325.772 users. It includes the anime’s status (dropped, complete, plan to watch, currently watching). ratings is given by users and information about the anime like genre, studios, and type.

For our project we will use the following variables:

  • MAL_ID: Id of the anime

  • Name: Name of the anime

  • Genders: Genre of the anime, categorical feature (nominal) with 32 levels.

  • Type: Types of anime (TV, movie…), categorical feature with 7 levels.

  • Episodes: Number of chapters, numerical feature.

  • Producers: Producer of the anime, categorical feature with 71 levels.

  • Studios: Studios, categorical feature with 642 levels.

  • Source: Sources of anime (Manga, Light novel, Book…), categorical feature with 16 levels.

  • Duration: Duration of the anime per episode, numerical feature.

  • Rating: Age rate of the anime, categorical feature with 7 levels:

    • PG-13: for teens 13 or older
    • G: All Ages
    • PG: Children
    • Rx: Hentai
    • R: 17+ (violence & profanity)
    • R+: Mild Nudity

Data Wrangling

In this section focuses on data wrangling, it is the process of transforming data from one “raw” data form into another format with the intent of making it more appropriate and valuable for our analysis. The goal of data wrangling is to assure quality and useful data. As we saw in the previous section, some categorical variables have plenty of levels, that’s why levels with very low frequencies are merged into a common level, such as “others”, as they are not of specific interest for our analysis.

Indeed, all the levels for which there are less than 100 anime are merged into the level “others”. For example, in the original dataset the variable Studios had 642 levels, however some studios have only one anime, these studios are not of specific interest and are grouped together.

The following table groups together all the important variables that we will use during our project.

Exploratory Data Analysis

In this section, we will conduct the Exploratory Data Analysis. This part is essential in the analysis of our database. It is important to critically explore the data and check for the underlying distributions for each variable before fitting different models. It gives us a first overview of the data. Thanks to this part, we will be able to better understand our data. To do this, we first start by carrying out what is called an univariate analysis. Here we will look at the variables one by one. After that, we will perform a multivariate analysis. Here, we will try to see if there are links between the variables.

Univariate Analysis

In this section, we have represented graphically some of the variables that we considered interesting to observe in a single way.

Gender

Gender is a nominal feature with 21 levels. We can note in the graph below that action is the most represented genre across the dataset. In second place is comedy.

Type

The type is a nominal feature with 7 different levels. In this variable, the type of anime is represented. As opposed to the genre, we want to know if the anime is broadcast or represented in TV or music form, for example. The largest anime type in the database is TV. Then we have OVA (Original Video Animation) and movie.

Episodes

What it is interesting to observe in the treemap below, it is the fact that mostly animes have only one episode. This may be due to the fact that these anime are movies. We also notice that many anime have twenty episodes. We can see below the ten most represented anime episodes.

In the histogram, we can see that on average an anime has seventy-one episodes. This is represented by the red line in the histogram. Anime with episodes between one and twenty are the most frequent in the dataset. One outlier can be observed, one anime has thousands episodes.

Producers

Producers is a nominal feature with twenty levels. We notice that NHK and TV Tokyo are the producers who hold most anime. This is quite normal, since NHK WORLD-JAPAN is a Japanese company which manages the Japanese public service radio and television stations. This company is the only Japanese public broadcasting group. Nevertheless, the database holds many so-called “others” and “unknown” anime producers that are not of interest for this analysis.

Studios

The Studios variable is a nominal feature with twenty two levels. We can see here that Toei Animation is the studio that produces the most anime. This studio is the studio that made the famous anime with many episodes called One Piece.

Source

The nominal feature Source has sixteen levels defines the source from which the anime was inspired. We notice that most of the anime is original. This means that it was not inspired by a book, for example. Then, we see in second place that the anime is inspired by manga.

Duration

This variable represents the duration of the episodes of the anime. We observe the duration of an anime episode is mostly less than 30 minutes.

The histogram shows that on average, the duration of an anime episode is twenty-five min. Most anime have a duration below twenty-five min. We can also notice that there is a lot of variances in the duration of the episodes.

Rating

The rating is a nominal feature that defines for whom the anime is addressed. It defines in particular the age of a person recommended to watch the anime. Here is what these letters correspond to:

  • G: General Audiences (All ages admitted)
  • PG: Parental Guidance Suggested
  • PG-13: Parents Strongly Cautioned
  • R, R+ and Rx : limited because of inappropriate language or pornographic anime

We observe here that there is a majority of anime that are produced for adults.

POPULARITY

The Popularity is the numerical variable we want to predict for our analysis. In the following histogram of Popularity we can observe a left-skewed histogram, a distribution skewed to the left said to be negatively skewed. This type of distribution has a large number of occurrences in the upper value cells (right side) and few in the lower value cells (left side).

We can also identify the skewness of our data by observing the shape of the box plot. The box plot is symmetric it means that our data follows a normal distribution. Indeed, the median is in the middle of the box, and the whiskers are about the same on both sides of the box, there are no outliers. Furthermore, data satisfying normal distribution is beneficial for model building. In fact, linear regression models for example are explicitly calculated from the assumption that the distribution is a bivariate or multivariate normal.

Multivariate analysis

After performing the univariate analysis and observing the variables in the database one by one, we will now perform the multivariate analysis. The univariate analysis allowed us to see if there could be relationships between the variables in the database. Thus, we will test graphically the links between certain variables.

Correlation matrix

Here we can see a correlation matrix to summarize the data. Each cell in the table shows the correlation between two variables. The higher the intensity of the color red, the higher the square between the variables, meaning they are positively correlated. Popularity is positively correlated to Studios, Producers and Source. It is negatively correlated with Rating. Meaning that there is a relation between those features and the Popularity of the anime.

Popularity of animes by duration and type

Here we observe that all types of anime whose duration is less than thirty minutes are the most successful.

Popularity of anime by ratings

Here, it is interesting to observe that most popular animes are the one for all ages (Rating G), followed by PG-13 rated anime (unsuitable for children under thirteen). The R rated anime (restricted under 17) are the one having a lower popularity.

Popularity of anime by Genre

On this graph we notice that popularity varies greatly by genre. The most popular anime are comedies or action anime. The anime of genre Harem, Ecchi or Military have a very low popularity.

Popularity of anime by Type

We see a difference of popularity depending on the type of the anime. The most popular anime are OVA, TV and movie anime. The Special type of anime have a very low popularity. In addition, very few anime have an Unknown Type.

Mean popularity by Studio

We see a difference of popularity depending on the Studio of the anime. The most popular anime are produced by Unknown, DLE, Nippon Animation and Shin_Ei_Animation Studios. The anime produced by the Bones Studio have the lowest popularity.

Mean popularity by Producer

The following graph shows the mean popularity of the anime by producer. It shows us that the NHK and Sanrio producers have the anime with the highest mean popularity, unlike the Aniplex and Genco producers that have the anime with the lowest mean popularity.

Distribution of the popularity and source

We can observe in this graph that anime are less popular when they are in the form of Radio or digital manga. When the source of an anime is Original, Manga or 4-koma manga for example, the popularity is much higher.

Supervised learning analysis

Linear Regression Model

The first model implemented to predict the popularity of an anime is a linear regression. The advantage of this method is that it is simple to apply, and the coefficients can be interpreted.

First, the linear regression is fitted using the 8 features listed above (Gender, Type, Episodes, Producers, Studios, Source, Duration and Rating). In order to have the simplest model explaining the best our outcome variable Popularity, we use the stepwise variable selection (backward selection).

The final model contains the 8 features. In terms of interpretation of the coefficients, the popularity increases in average by 196 when switching the gender from action (reference level) to advanture. In addition, the popularity increases on average by 1333 when the type switches from movie to music, etc.

To avoid overfitting, the data base is split using 5-fold cross-validation. The choice was made for CV because the dataset contains enough observations and the scores obtained for the predictions were equal than with bootstrap.

#> Linear Regression 
#> 
#> 12765 samples
#>     8 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 10212, 10211, 10212, 10213, 10212 
#> Resampling results:
#> 
#>   RMSE  Rsquared  MAE 
#>   2808  0.693     2207
#> 
#> Tuning parameter 'intercept' was held constant at a value of TRUE

The final model has 89 coefficients (including the levels of each feature). As we can see on the R output, the model has an RMSE of 2808 and R2 of 69%, which are relatively high scores. Moreover, the coefficients have changed slightly by using CV instead of separating the data using a simple test set (25% of the data) and training set (75% of the data). By looking at the coefficients, the average change in Popularity when a feature changes from the reference level to any other level can be estimated. Indeed, when switching from rating G (all ages) to rating PG-13 (PG-13 for teens 13 or older), the popularity of an anime decreases by 3269,75. Due to the quantity of coefficients, they are not all presented in the report, but they can be analyzed in details in the code.

Variable importance

The relationship between each predictor and the popularity can be evaluated to estimate the contribution of each variable to the model. In the case of linear regression, the absolute value of the t-statistic for each model parameter is used.

The graph above shows that the level rating PG-13 contributes the most to the model. Overall, the Rating is the most important feature to predict the popularity of an anime.

In fact, after fitting a model without the feature Rating, the RMSE increases by 212 and the R2 decreases by 5%, showing that without the predictor Rating, the model performs worse.

Predictions with linear regression

The model is evaluated by making predictions on the tests sets using CV.

The following graph shows the quality of the model. The predicted values follow the red line, nevertheless, there is still a lot of variance. To measure the performance of the model and to later be able to compare the linear regression model with the regression tree the RMSE, MAE and R2 are computed on the test and training set.

Performance of the linear regression

The scores computed on the test set are approximately equal to the one on the training set which confirms that there is no sign of overfitting. Moreover, even with CV the linear regression is performing well: RMSE of 2835 and R2 of 69%.

Linear regression scores
Model RMSE R2 MAE
Test set Score 2836 0.690 2237
Training set Score 2790 0.696 2191

Regression Tree Model

The second model implemented to predict the popularity is a regression tree. The model is great for predictions, very intuitive, easy to explain and to interpret.

Regression tree with anova

The regression tree is build by splitting data_Total (n = 13607) into dt.tr (80%, n = 13607) and dt.te (20%, n = 3400). We proceed as a first step to build a full tree, subsequently, perform 10-fold cross-validation to help select the optimal cost complexity cp.

#> 
#> Regression tree:
#> rpart(formula = Popularity ~ Gender + Type + Episodes + Producers + 
#>     Studios + Source + Duration + Rating, data = dt.tr, method = "anova")
#> 
#> Variables actually used in tree construction:
#> [1] Rating  Source  Studios Type   
#> 
#> Root node error: 3e+11/13607 = 3e+07
#> 
#> n= 13607 
#> 
#>     CP nsplit rel error xerror  xstd
#> 1 0.40      0       1.0    1.0 0.008
#> 2 0.09      1       0.6    0.6 0.007
#> 3 0.07      2       0.5    0.5 0.006
#> 4 0.03      3       0.4    0.4 0.005
#> 5 0.01      4       0.4    0.4 0.005
#> 6 0.01      5       0.4    0.4 0.005
#> 7 0.01      6       0.4    0.4 0.005
#> 8 0.01      7       0.4    0.4 0.005

There are 8 possible cp values in this model. The 1-SE rule is used to cut the branches that do not participate enough to the prediction quality. We take the tree with the lowest xerror, add to this the xstd, and cut at the simplest tree whose xerror is lower than this bound. In action, xerror + xstd = 0.4 + 0.005 = 0.405. Thus, the tree with three splits (xerror = 0.4 and cp=0.03) should be used.

Here is the pruned tree:

This tree has 4 final nodes, it is simpler and shows that the most important variables are Rating and Studios. The first node is at Rating = [PG-13, R, R+ , Rx]. The tree can be interpreted as follows: if the rating of the anime is PG-13,R, R+ or Rx, we go to the left, if the Studio is AP (A1-Pictures) for example, go left, the predicted popularity of the anime is 4663.

The most important indicators of Popularity are Rating, followed by Studios, Gender and source. Only the two most important features appear in the pruned tree.

Predictions with regression tree

The last step is to make predictions on the test set. The pruning process leads to an average prediction error (RMSE) of 3380 and MAE of 2742 on the test set (the RMSE punishes large errors more harshly). It is not too bad considering the standard deviation of Popularity is 5069. The scores on the training set are approximately equal to the one on the test set, showing that there is not overfitting.

Regression tree scores
Model RMSE R2 MAE
Test set Score 3380 0.56 2742
Training set Score 3355 0.56 2723

The graph represents the predicted against the observed popularity using the final pruned tree, the 4 possible predicted values do a decent job of binning the observations. The model only predicts a low popularity value of 4663, a high popularity of 14000 and 2 different medium popularity values (9392 and 9774). This tree simplifies a lot our outcome variable popularity unlike the linear regression which predicts a different value for each anime according to its variables.

Regression tree with Caret Package

The second regression tree is fitted with caret::train(), specifying method = "rpart". We will build the model using 10-fold cross-validation to optimize the hyperparameter CP and avoid overfitting. We are letting the model look for the best CP tuning parameter with tuneLength (this parameter defines the total number of parameter combinations that will be evaluated).

After fitting the model with tuneLength of 10, the first cp (0.0095) is chosen, it produces the smallest RMSE (3305) with an R2 equal to 57%. As we can see on the graph below, it looks like the best performing tree is the unpruned one.

The final model obtained after tuning the cp is visually interpretable and has 10 final nodes. Indeed, the following tree is more precise than the one studied above.

The most important indicator of Popularity in this regression tree are Studios Unknown, followed by Producers Unknown and RatingRx. Furthermore, we can see that the important variables differ from the ones of the previous regression tree, the analysis on a same dataset can lead to different trees and different interpretation.

The interpretation of this tree is the following: if the studio is unknown (=1), go on the right, if the Rating is Rx (restricted anime), go on the left and the Popularity predicted is 8028. On the contrary, if the studio and the producer are known we go two times on the left, if the anime episodes are longer than 21 minutes, go left, if the source is known, go left and if the duration is less than 25 min, go left. The predicted popularity is 2835.

Predictions of the Regression Tree (Caret)

The RMSE, MAE and R2 are computed on the test and training set. The scores computed on the training set (R2= 0.571) are almost equal to the one computed on the test set (R2=0.574), showing that the model performs well and that there is no overfitting of the data.

The following graph represents the predicted values against the observed popularity.

The model predicts 10 different values of popularity for the animes. This tree is much more precise than the previous one which predicted only 4 different popularity values. To compare the prediction quality of the two regression trees and the linear regression, the following scoreboard can be used.

The table summarizes the scores of the linear regression and the two regression trees.

Both regression trees have approximately the same scores, one with a higher RMSE and lower MAE and the opposite for the other one. Indeed, the MAE is less sensitive to specific errors and prefers models with overall lower error unlike the RMSE. After plotting the residuals against the predicted values of each model, the model with the lowest RMSE and best R2 is chosen because the models don’t have very large errors.

To conclude, the linear regression is the best performing model to predict the popularity of the anime, the model has the highest R2 and lowest RMSE and MAE. Moreover, the linear regression predicts more accurate values based on the anime features, while the regression trees only predict 4 or 10 different values. However, the regression tree fitted with caret is easier to interpret.

Classification Tree Model

The last model implemented to predict the popularity is a classification tree. The aim is to have a model with a good prediction quality and easy to interpret.

Based on the distribution of popularity, the outcome variable will be divided into 3 groups: Low, Medium and high. Since we are more interested in popular animes, half of the animes will be considered having low popularity.

The data will be separated into training set and test set. In training set, 5639 instances represent low popularity, twice greater than other popularity levels. This may lead to an inaccurate results in the prediction, but let’s check the performance at first. Balancing data would be implemented later on.

#> 
#>   High    Low Medium 
#>   2838   5639   2861

As expected, training data is not balanced.

Similar to steps of regression tree, first a full tree is built. The full tree has 8 branches. In order to obtain a better result, branches that do not participate enough to the prediction quality will be cut.

#> 
#> Classification tree:
#> rpart(formula = PopLevel ~ Gender + Type + Episodes + Producers + 
#>     Studios + Source + Duration + Rating, data = df.tr)
#> 
#> Variables actually used in tree construction:
#> [1] Rating  Source  Studios Type   
#> 
#> Root node error: 5699/11338 = 0.5
#> 
#> n= 11338 
#> 
#>     CP nsplit rel error xerror  xstd
#> 1 0.28      0       1.0    1.0 0.009
#> 2 0.06      1       0.7    0.7 0.009
#> 3 0.02      2       0.7    0.7 0.009
#> 4 0.02      4       0.6    0.6 0.009
#> 5 0.01      6       0.6    0.6 0.009
#> 6 0.01      7       0.6    0.6 0.009

The root node contains 5699 errors out of the 11338 values (50%). According to the 1-SE rule, keeping 4 or 7 branches doesn’t make any change on the result. In this case, we will keep the shortest tree - four branches. The argument in prune should be set to any value between the cp of 7-split tree (0.01) and the 4-split tree (0.02). Thus, four splits will be applied.

Let’s now look closer into the pruned graph. It seems that Rating is one of the most important variable for the popularity prediction, followed by studios, especially when the animes are produced by DLE and TMS Entertainment. If the animes are produced by A-1 Pictures, SIC, Bones etc, they are more likely to be classified as low popularity. More detailed interpretation will be introduced after balancing data.

Variable importance

The tree above is difficult to interpret because of its size, the following graph allows to identify the important variables of the model to predict popularity.

Like for the first regression tree and the linear regression, rating is the most important indicator of the popularity, followed by Studios and Source.

Predictions with the classification tree

Now we we’ll start to make the predictions for classification tree.

The pruning process leads to an accuracy of 0.694 and a Kappa of 0.479 as we can see on the R output below. Meanwhile, low popularity has a very high sensitivity rate, Medium popularity has very low sensitivity representing 17%. High popularity has the highest balanced accuracy (0.81). Overall, the model performs relatively well.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction High  Low Medium
#>     High   1047   91    445
#>     Low     261 2638    698
#>     Medium  106  136    247
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.694         
#>                  95% CI : (0.681, 0.706)
#>     No Information Rate : 0.505         
#>     P-Value [Acc > NIR] : <2e-16        
#>                                         
#>                   Kappa : 0.479         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: High Class: Low Class: Medium
#> Sensitivity                0.740      0.921        0.1777
#> Specificity                0.874      0.658        0.9434
#> Pos Pred Value             0.661      0.733        0.5051
#> Neg Pred Value             0.910      0.890        0.7793
#> Prevalence                 0.249      0.505        0.2452
#> Detection Rate             0.185      0.465        0.0436
#> Detection Prevalence       0.279      0.635        0.0863
#> Balanced Accuracy          0.807      0.789        0.5606

The confusion matrix shows the performance of the prediction. In order to visualize the result, the following graph is presented. On this graph, we can see that the model performs best in predicting the Low popularity category (dark blue) and the worse in predicting the medium popularity class (light blue).

The F1 score provides a harmonic mean of the specificity and the sensitivity, the following table summarize F1 score for the different classes. It seems that the prediction for medium popularity still need great improvement.

F1
F1_high 0.801
F1_medium 0.299
F1_low 0.768

Predictions with balaced data

We will now balance the training set to check if we can get improvement in the prediction results. To balance the classes, re-sampling will be applied to increase the weight of the minority classes (medium and high).

#> 
#>   High    Low Medium 
#>   5639   5639   5639

Modeling and prediction after re-sampling

Now we have a balanced data set where each class has the same amount as the largest class in the original training set. The model is fitted and pruned again.

#> 
#> Classification tree:
#> rpart(formula = PopLevel ~ Gender + Type + Episodes + Producers + 
#>     Studios + Source + Duration + Rating, data = df.tr.res)
#> 
#> Variables actually used in tree construction:
#> [1] Rating  Source  Studios
#> 
#> Root node error: 11278/16917 = 0.7
#> 
#> n= 16917 
#> 
#>     CP nsplit rel error xerror  xstd
#> 1 0.36      0       1.0    1.0 0.005
#> 2 0.06      1       0.6    0.6 0.006
#> 3 0.04      2       0.6    0.6 0.006
#> 4 0.02      3       0.5    0.5 0.006
#> 5 0.01      4       0.5    0.5 0.006

This table suggests a 3 splits tree, cp=0.02. According to the 1-SE rule, 0.5 + 0.006 = 0.5006, in addition, the simplest tree whose xerror is lower than this bound is the unpruned tree with 3 nodes.

This graph reveals that Rating and Studios are the most important variables. The interpretation of the tree is the following, if the Rating is G, PG or Unknown, go left, if the Studio of the anime is A-1 Pictures for example go left, and finally because the Studio name is different from DLE, TMS Entertainment or Unknown go right. The predicted popularity is medium.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction High  Low Medium
#>     High   1047   91    445
#>     Low      36 1957    245
#>     Medium  331  817    700
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.653         
#>                  95% CI : (0.641, 0.666)
#>     No Information Rate : 0.505         
#>     P-Value [Acc > NIR] : <2e-16        
#>                                         
#>                   Kappa : 0.467         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: High Class: Low Class: Medium
#> Sensitivity                0.740      0.683         0.504
#> Specificity                0.874      0.900         0.732
#> Pos Pred Value             0.661      0.874         0.379
#> Neg Pred Value             0.910      0.735         0.819
#> Prevalence                 0.249      0.505         0.245
#> Detection Rate             0.185      0.345         0.123
#> Detection Prevalence       0.279      0.395         0.326
#> Balanced Accuracy          0.807      0.791         0.618

Comparing to the previous results, re-sampling decreased slightly accuracy from 0.694 to 0.653. If we look closely into the results of sensitivity, popularity at medium level was remarkable improved, increased from 0.18 to 0.504. Meanwhile, sensitivity of low popularity decreased ~24%, high sensitivity has remained constant. Result of specificity is the inverse. Balanced accuracy improved in general.

When it comes to the F1 scores, medium popularity prediction improved remarkably from 0.299 to 0.597.

F1
F1_high 0.801
F1_medium 0.597
F1_low 0.777

Variable importance balanced data

After re-sampling, it seems that the importance of type decreased a lot, producers seems more important than in the previous results.

Eliminate Type, Duration, Producers & Episodes to see if accuracy will be improved

For the sake of simplicity of the final model, we will include only important variables. By eliminating type, duration, producers and episodes one by one, on the R output we notice that without these variables the accuracy of the final result isn’t influenced. Thus finally we decide to consider only Genre, studio, source and rating to plot the tree.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction High  Low Medium
#>     High   1047   91    445
#>     Low      44 2246    356
#>     Medium  323  528    589
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.685         
#>                  95% CI : (0.673, 0.697)
#>     No Information Rate : 0.505         
#>     P-Value [Acc > NIR] : < 2e-16       
#>                                         
#>                   Kappa : 0.501         
#>                                         
#>  Mcnemar's Test P-Value : 6.3e-15       
#> 
#> Statistics by Class:
#> 
#>                      Class: High Class: Low Class: Medium
#> Sensitivity                0.740      0.784         0.424
#> Specificity                0.874      0.857         0.801
#> Pos Pred Value             0.661      0.849         0.409
#> Neg Pred Value             0.910      0.795         0.811
#> Prevalence                 0.249      0.505         0.245
#> Detection Rate             0.185      0.396         0.104
#> Detection Prevalence       0.279      0.467         0.254
#> Balanced Accuracy          0.807      0.821         0.612

By applying cross-validation, when cp = 0.00293, the model has the best accuracy and kappa. We will now apply the best cp value and the most important variables to the model.

The goodness of the final model after tuning the CP and re-sampling has 20 nodes and doesn’t change in terms of accuracy and Kappa. That’s why the simplest model fitted after re-sampling the data, with less branches is kept.

To conclude, the classification trees before and after re-sampling perform relatively well, however the overall F1 score and the balanced accuracy are better after re-sampling. Classification trees provides a very intuitive graph. Nevertheless, an important limitation of classification tree’s is that they are unstable, if we don’t take care of setting seed, the structure of trees change all the time.

Unsupervised learning analysis

Principal component analysis

The Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique used for dimensionality reduction. In the context of our project, the PCA is used to inspect the data, find clusters and dependence between the data. The PCA enables to find combinations of features which shows as much variance as possible.

The levels of the categorical features are assigned to a number (ex: Gender Action=1, Gender Adventure=2…). The following table summarizes the dataset used to implement the principal component analysis.

The correlation matrix in the EDA part demonstrated the relationship between different variables, for instance, Type and Studios are negatively correlated, Studios and Source are positively correlated. This indicate that information in our dataset containing observations described by multiple inter-correlated variables. Each variable could thus be combined into different dimension. In order to reduce the dimensionality of our data and graphically visualize it, PCA would be implemented sequentially.

Circle of correlations

We will start by introducing the circle of correlation, which allows us to draw conclusions about correlations of different variables. In PCA we are interested in the components that maximize the variance. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values. That is why the circle of correlation is fitted once the features are scaled.

This correlation circle shows two first principal components. PC1 explains 20.8% of the variance of the data, PC2 explains 17.6%. In total, 38,4% of the variance of the data is explained by these two components. Meanwhile,there is still a lot of unexplained variance. The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components), Type and Duration are the closest to the circle of correlation. Variables like Rating and Producers, that are closed to the center of the plot are less important and less explained by the two components.

PC1 is highly positively correlated with Source, Studios,while the quality of Source on the factor map is not as good as Studios. PC 1 is also positively correlated with Producers as well. Besides, PC1 is negatively correlated with Type, which confirms that Type and Studios are negatively correlated according to correlation matrix.

PC2 is strongly negatively correlated with Duration and positively correlated with Type,Gender,Episodes. In addition,Duration has a long array, which indicate that it is well represented.

In order to visualize the contributions of each features in 2 dimension, the following Barplots is introduced. The result confirms the conclusions drawn from the circle of correlations.

Individual biplot

The individual biplot represents all the animes along the two dimensions. Due to the huge number of observations it is difficult to interpret this graph visually. However, two clusters can be observed. The first cluster on the top left is positively associated to PC2 and negatively correlated to PC1. The second cluster of anime contains much more observations and is in the middle of the graph.

The studio, source level and the Duration of anime 8207 are small, in addition its Type level is low (negatively associated to PC1 and PC2). Anime 11337 on the top left has a high type level, low duration, and a low Studios and Source level as it is highly positively correlated to PC2 and negatively correlated to PC1.

Since we need to reach 75% of representation of the data in order to reduce the dimension of the dataset and find dependence between the features, the screenplot would be an alternative to look at.

Scree plot

According to scree plot, 5 dimensions are needed to achieve 75,3% of representation of the data, showing that most of the features are independent. All variables are contributing to at least one of the five dimensions. This means that the three biplots below represent >75% of the data. Due to the amount of instances in the data, the biplots are not easy to read. However,below it shows two clusters of anime that are well separated by PC5 and a bit by PC3.

By looking at the two dimension graphs it is difficult to draw information because the clusters are overlapping among the two dimensions. However, most real-world datasets have inherently overlapping information, which could be best explained by overlapping clustering methods that allow one sample belong to more than one cluster.

Thanks to the screenplot we can now deduce that the Rating and source are the most contributing features to PC3. Producers and gender contribute the most to Dimension 4. Episodes, gender and source are the most important contributor to Dimension 5.

PCA to represent clustering results

To finish the PCA analysis, we combine clustering and PCA, clusters are represented on the map of individuals. Based on the dendogram and the previous biplots, we made the choice to represent 3 clusters along the two dimensions that explain the most variance of the data. Moreover, the clusters are made using the manhattan distance.

From this graph, we notice that the clusters are almost separated by the two dimensions;

  • Cluster 1: is less well separated by the dimensions. In fact there is more variation in cluster 1 and it is approximately located in the middle of dimension 2 and all along dimension 1. This means that anime in cluster 1 have overall an average duration and type, while the Studios, Producers and Source levels (features mostly explained by dimension 1) varies a lot for anime inside of this cluster.

  • Cluster 2: is negatively associated to dimension 2, meaning that anime in cluster 2 have a high duration and a low type level. Moreover, cluster 2 is positively correlated to dimension 1, anime in cluster 1 have a high Studio and Source level.

  • cluster 3: is well separated by dimension 3. It is highly positively correlated to PC2 and negatively correlated to PC1. Meaning that anime in cluster 3 have mostly a low duration and a high type level.

In order to deepen and ensure this analysis, we will classify the variables into 3 groups using the k-means clustering algorithm.

This circle of correlations represents the variables explaining the observations of each cluster. Cluster 1 is mostly explained by Gender, Producers, Studios and Source. In fact this cluster is represented in the middle of the biplot (see previous graph) and mostly along dimension 1. Cluster 2 is explained by some of the most contributing variables of dimension 2, Type and Episodes. Finally, cluster 3 is explained by Rating and Duration, Duration being highly correlated to dimension 2.

In conclusion, the PCA enabled us to inspect the data, and deduce that 5 dimensions are needed to explain at least 75% of the variance in the data. If a high number of dimensions are needed to explain the data, meaning the features are independent. If they were highly correlated, the number of dimensions would be small. Moreover, the PCA helped to identify 3 potential clusters in the data along the dimensions 1 and 2. One cluster having a large PC1 and the two others a large PC2. Now that we discovered how to reduce the dimensionality of the data set. This analysis could be complemented by other cluster methods. Indeed, the manhattan distance is only valid for numerical variables. In order for a clustering algorithm to yield sensible results, we could use a distance metric that can handle mixed data types such as the Gower distance.

Clustering

Non-supervised models cluster and analyze datasets that don’t have labels by finding some patterns and groups. In our case, we decided to use the clusters by using the Gower distance, because our dataset contains categorical and numerical variables. After creating the distance matrix, we can observe the two similar pairs table, which output two films of Dragon Ball. The “two most dissimilar” outputs Aim for the Ace! (1979), which is a film, and One Piece, which is one of the most popular anime, not finished yet with a lot of episodes.

MOST SIMILAR PAIR
Episodes Duration Popularity Gender Type Producers Studios Source Rating
820 1 50 1494 1 1 19 20 7 3
818 1 50 1505 1 1 19 20 7 3
MOST DISSIMILAR PAIR
Episodes Duration Popularity Gender Type Producers Studios Source Rating
666 1 26 7660 20 1 19 20 14 1
101 146 24 52 1 6 1 3 7 3

Visualizing cluster for 1000 observations

We can picture the cluster thanks to a plot showing the several clusters. However, our data contains too many observations for the capacity of our computers, so we needed to limit the analysis for the 1000 first observations to be able to plot them.

We can see here that the number of cluster that we should use is 5 according to the silhouette graph, because it is the highest point giving the highest Silhouette Width.

In the plot below we can slightly observe some clusters. However in the middle of this graph the clusters are overlapping. Note also that having 5 clusters for our graphical analysis can be difficult to read around the limits of the clusters.

Visualizing clusters for the top 100 anime

To go further in our analysis, it is interesting to visualize the top 100 of the most and less popular anime. We can see that using the Gower distance, the data is easily separated in two categories. Indeed, thanks to their differences in terms of popularity, we have two clusters showing.

The repartition is considered as great, although we have two errors appearing (Two red points are grouped in the blue section).

Dendrogram

This is certainly the part of our cluster analysis where we can obtain valuable information. A dendrogram for the 20th most popular anime can help us to picture some recommendations due to the differences between all of them. We did this analysis with the 20 most popular ones, and the repartition of the anime, is subjectively, according to our perception and knowledge, is quite reliable.

To better understand this output, we can say that the more the cluster is down, the more the anime are comparable and similar with their neighboring leaves.

A Silent Voice, Steins:Gate and Your Name (the blue section) are isolated, and we can be satisfied of this because Silent Voice and Your Name are two very romantic films, and the others are anime series. Steins gate could be a possible error because it should be more near the Re:Zero anime if we compare subjectively.

We can accept this dendrogram thanks to the season of the same anime reunited, and for subjective opinions. For example, Tokyo Ghoul and Shingeki no Kyojin are comparable thanks to their gender, also for HunterxHunter and Naruto, because they have approximatively the same fanbase and gender.

Conclusion

Due to the data size, we were not able to apply K-NN in the analysis, since it will make the predictions based on the whole data base. Naive Bayes wasn’t chosen, since this method assumes that all the variables are independent and it uses density estimation, which would be difficult for us to implement since most of our variables are categorical.

Among all the methods introduced above, linear regression and classification tree provide quantitative results. As shown in the end of regression tree part, the linear regression is the best model according to the RMSE, R2 and MAE score to predict the popularity.

Rating is the most important contributor for both linear regression model and regression tree. While from linear regression model, we can observe that the rating level PG-13 is the key indicator for popularity. Studio is one of the most important variable in regression tree model.

Classification tree provides the most intuitive result, but again, trees are not stable, we have to keep in mind to not over interpret the results. Furthermore, data balancing is relative important for this method. Balanced accuracy and F1 score increased after re-sampling. Important variables in the classification tree are the same as those for regression tree.

Regarding Unsupervised learning method, PCA enables to reduce the dimensions and identify 3 potential clusters among the 2 first dimensions which explain most of the variance. Meanwhile, the features seem independent as 5 dimensions will be necessary to explain enough variance of the data.

Because of the large amount of instances, we conducted cluster analysis on 1000 anime. Even if it was not representative for our whole data set, we were able to identify clear clusters. In fact, most real-world datasets have inherently overlapping information, further overlapping clustering methods could be implemented to improve this cluster analysis. The dendrogram anlysis enabled to obtain valuable information about the 20 most popular anime.

Recommendation

As a anime licensor, the goal of our project was to predict the popularity of anime, thanks to the several models implemented we could identify that the Studio and Rating are decesive variables of the anime’s popularity. To be more precise, the anime produced an Unknown studio or DLE Studio and of rating “G” for all ages, are those with the highest popularity overall, in order to know for which anime we should buy, renew the licenses or not, it is essential to be aware of the Studio, rating and producer of the anime. Indeed, these are the most essential indicators of popularity of an anime.